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ABSTRACT The additive relationship matrix plays an important role in mixed model prediction of breeding 
values. For genotype matrix X (loci in columns), the product XX' is widely used as a realized relationship 
matrix, but the scaling of this matrix is ambiguous. Our first objective was to derive a proper scaling such 
that the mean diagonal element equals 1 +f, where f is the inbreeding coefficient of the current population. 
The result is a formula involving the covariance matrix for sampling genomic loci, which must be esti- 
mated with markers. Our second objective was to investigate whether shrinkage estimation of this co- 
variance matrix can improve the accuracy of breeding value (GEBV) predictions with low-density markers. 
Using an analytical formula for shrinkage intensity that is optimal with respect to mean-squared error, 
simulations revealed that shrinkage can significantly increase GEBV accuracy in unstructured populations, 
but only for phenotyped lines; there was no benefit for unphenotyped lines. The accuracy gain from 
shrinkage increased with heritability, but at high heritability (> 0.6) this benefit was irrelevant because 
phenotypic accuracy was comparable. These trends were confirmed in a commercial pig population with 
progeny-test-estimated breeding values. For an anonymous trait where phenotypic accuracy was 0.58, 
shrinkage increased the average GEBV accuracy from 0.56 to 0.62 (SE < 0.00) when using random sets of 
384 markers from a 60K array. We conclude that when moderate-accuracy phenotypes and low-density 
markers are available for the candidates of genomic selection, shrinkage estimation of the relationship 
matrix can improve genetic gain. 
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Mixed models play an important role in the prediction of breeding 
values for plants and animals. Under the assumption that the breeding 
values are multivariate normal with genetic covariance G, best linear 
unbiased prediction (BLUP) can be used to calculate the breeding 
values from phenotypic data (Henderson 1984; Bernardo 2010). In 
the absence of molecular markers, genetic covariance can be estimated 
via the approximation 

G = Ao-i (1) 
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where each element of the numerator relationship matrix A is 
twice the coefficient of coancestry and depends on the probability of 
identity-by- descent (IBD) from a base population with additive ge- 
netic variance cr\ (Kempthorne 1957; Lynch and Walsh 1998). We 
regard Equation 1 as an approximation because it depends on a num- 
ber of population genetic assumptions that rarely hold in breeding 
populations, particularly in plant breeding (Goddard 1986; Lynch and 
Walsh 1998; Piepho et al. 2008). 

When molecular markers are available, it is often assumed that the 
goal is to estimate the probability of IBD, but in fact, the goal is to 
estimate the genetic covariance, which depends on the genotypes of 
the causal loci and is fundamentally a state property. It follows that for 
a complex trait for which the infinitesimal model is a suitable ap- 
proximation, G depends on the probability that the alleles at a random 
locus are identical in state, or IBS (Yang et al. 2010; Powell et al. 2010). 
Our first objective was to develop a theoretical framework for esti- 
mating the (realized) relationship matrix that is suitable for inbred 
lines and consistent with the IBS approach [i.e. without invoking a base 
population). 

As the number of markers increases, the probability of IBS at a 
random marker approaches the probability of IBS at random genomic 
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loci. This limiting behavior is the basis for estimators of the form 
(Habier et al. 2007) 

G = 6XX' (2) 

where X is the «xm genotype matrix for m markers and n Unes, 
and the proportionality constant 6 is fit by maximum likelihood (or 
REML). To see this connection more explicitly, note that when m bi- 
aUelic markers are coded as { — 1,0,1}, the nxn matrix of IBS coeffi- 
cients (for the markers) is ^(m^'XX' + J), where J is a matrix of ones 
(Piepho 2009). 

Although Equation 2 is sufficient for breeding value prediction, to 
define a realized relationship matrix, a convention is needed concern- 
ing the scaling of the matrix. By analogy with the numerator relation- 
ship matrix, we propose that the mean of the diagonal elements equals 
1+/, where /is the inbreeding coefficient of the current population 

{An) = 1 +/ (3) 

(Beginning with Equation 3, the symbol A denotes the IBS re- 
lationship matrix, and the angular brackets denote the average with 
respect to an index, in this case, /.) Equation 3 requires a concept for 
inbreeding that is consistent with the IBS framework. Following 
Powell et al. (2010), we define the inbreeding coefficient for a single 
locus as the intra-individual gametic correlation, but our extension to 
the multi-locus case is different and emerges as an algebraic necessity 
during the derivation. 

The strategy embodied in Equation 2, in which the IBS properties 
of the markers are used as a proxy for the IBS properties of any two 
genomic loci, requires the number of markers to be much larger than 
the number of lines (m S> n). However, to minimize genotyping costs 
in breeding programs, it is common to use low-density (e.g. 384) SNP 
arrays, in which case the number of lines may exceed the number of 
markers. To develop a suitable estimator for this situation, we express 
the realized relationship matrix in terms of the nxn variance-covari- 
ance matrix (2) for genomic loci [i.e. when sampling columns of the 
genotype matrix). Equation 2 is equivalent to estimating X with the 
sample covariance S, which in the large m limit is asymptotically 
optimal with respect to mean-squared error (MSE) (Casella and 
Berger 2002). 

When the number of lines exceeds the number of markers, the 
MSE of the sample covariance matrix is no longer optimal because 
there are too many parameters to estimate [rp-ll) relative to the num- 
ber of marker data points [nm). This type of phenomenon is well 
known in the statistics literature under the name Stein's paradox 
(Stein 1956; Efron 1975), and it was James and Stein (1961) who first 
proposed shrinkage to reduce the MSE. Yang et al. (2010) have pro- 
posed a shrinkage estimator for the realized relationship matrix, but it 
does not preserve Equation 3. We propose an alternative estimator 
that does not shrink the inbreeding coefficient, and we investigate its 
impact on the accuracy of breeding value prediction in rice, barley, 
maize, and pig poptdations. 

THEORY 

Derivation of A in terms of causal loci 

Initially we work with causal loci, using an overscript tilde for 
variables. Eventually we will work with a marker matrix and use the 
same symbols without the tilde. Consider a trait with fn causal bi- 
aUelic loci, where Xn e {0,1,2} is the allele content at locus k in line i 



(the assignment of alleles is arbitrary). For an additive trait, the genetic 
value of line / can be written as 

m 

'^■=^i^ik-^Pk)'^k = Yl Uk (4) 

k=l k 

where the random genetic effects are taken to be physiolog- 
ical parameters of the causal loci, and W is the centered geno- 
type matrix. By centering the allele content of each locus 
(pp. = (2n) Xik are the allele frequencies), the genetic values 
are expressed relative to the population mean a,- = 0). Fur- 
thermore, as proved in Appendix i, the additive genetic value in 
Equation 4 equals the breeding value, that is, twice the mean prog- 
eny value when the current population (regardless of its structure) 
is randomly mated. 

As in the introduction, we denote the variance-covariance matrix 
for the breeding values by G 

G = var„[a] = var„ [Wu] = Wvar[u]W' = crjm^'ww' (5) 

The subscript u on the variance operator indicates that it is with 
respect to the random genetic effects and not the genotypes — the 
latter are simply given and not assumed to follow any distribution. 
For the last step in Equation 5, we have assumed the ilk i.i.d. with 
constant variance cr^/fh, which is appropriate for a complex trait with 
many causal loci of comparable effect {i.e. well described by the in- 
finitesimal model). The variance per locus is scaled by m so that cr^ is 
an intensive property that does not depend on the number of causal 
loci. 

We are now in a position to decompose G as Acr^, where A is the 
(IBS) relationship matrix satisfying Equation 3. This convention for 
the scaling of A implies that 

tr(G) = o-2tr(A)=o-2n(l+/) (6) 

where the trace operator, tr(»), sums the diagonal elements of 
a matrix. Genetic formulas for both the population inbreeding co- 
efficient / and the variance parameter o^ emerge upon applying the 
trace operator to Equation 5 

m n 2 

tr(G) = alm-'Y.Jl(^'k-2Pk) (7) 

k=l i=l 

in n 2 

= O-^*"^ + {Xik2-Pk)] (8) 

k=l i=l 

m 

k=l 

Equation 7 follows from the identity tr(ZZ') = '^^ik'^fk' which 
holds for any matrix Z (Searle 1971). Equation 8 follows by writing 
the diploid genotype as the simi of its two gametes: X,vt = Xjki + Xik2, 
where Xjki and Xjk2 are binary variables. Equation 9, which follows 
from several algebraic manipulations, introduces the notation q = 1-p 
as well as the intra-individual gametic correlation /j. at a single locus 
(Powell et al. 2010) 



1406 I J. B. Endelman and J.-L. Jannink 



:-5iG3'Genes | Genomes | Genetics 



^T.i{^ikl-Pk){xik2-Pk) 

Pk^k 



(10) 



In Appendix 2, we show that^ is also the deviation from Hardy- 
Weinberg proportions and thus interpretable as the inbreeding co- 
efficient for the population. 

Upon comparing Equation 6 with Equation 9, we see that the 
coefficient of the A matrix is 



2°"'> 'Y^khh = ^(^liMk) 



(11) 



and the inbreeding coefficient / is a weighted average across loci 



/ = Pkfk 

a _ Mk 



(12) 



Dividing G (Equation 5) by a-^ (Equation 11) yields the following 
formula for the relationship matrbc 



A = 



WW' 

2e;m 



'^fn-^Y.jPflj 



(13) 



As expected, the parameter a-^ cancels out and does not appear in 
the formula for A. The second step in Equation 13 is an identity from 
matrbc algebra, in which WW' has been written as a sum over m 
matrices with dimension nxn, formed from the outer product of the 
columns of W (denoted by W.iJ. In the limit of the infinitesimal 
model, the averages in Equation 13 converge to the corresponding 
expected values under random sampling of genomic loci. Letting the 
random variables w and p denote the centered genotype and allele 
frequency, respectively, the result is 



A = 



E[ww'] var[w] + E[w]E[w'] 
2EN ' 2^pq] 



(14) 



By shrinking the estimate, although this introduces bias, the esti- 
mation error can be decreased. One type of shrinkage estimator is the 
weighted average 



2 = ST+(1-5)S 



(16) 



where the shrinkage intensity S ranges from 0 to 1. When 8 = 0, 
there is no shrinkage and the estimator equals the sample covariance 
S. When 5 = 1, the estimate is completely shrunk to a target T. The 
target represents a low-dimensional model that can be estimated with 
greater precision than 2 because it has fewer parameters (Schafer and 
Strimmer 2005). 

A common target is T = (S,-,)! where (S,,) is the mean of the 
diagonal elements of S (Ledoit and Wolf 2004). Substituting this for- 
mula into Equation 16, one can verily that the total variance is estimated 
without bias: E[tr(2)] = tr(2). Shrinkage is not needed for estimating 
the total variance because it is a single parameter. Similarly, even with as 
few as m = 96 markers, the row means of the marker matrix (W. wiU 
be a near-optimal estimator for the genome-wide parameter E[w] be- 
cause there are only n parameters to estimate from nm data points. Our 
shrinkage estimator (denoted with an asterisk) is thus 



d{Sii)l+{l-8)S+{W.k)(W'.k) 
^{Pjlj) 



(17) 



To select the shrinkage intensity, we make use of results from 
Ledoit and Wolf (2004), who derived an analytical formula for the 
shrinkage intensity that minimizes the expected MSE for the covari- 
ance matrix 



argmin e[||2-2| 



(18) 



(The squared Frobenius norm is the sum of the squared 
elements of a matrix.) In Methods, we give the asymptotic solution 
to Equation 18. From this solution, a usefiil heuristic can be derived 
for when shrinkage is expected to be negligible (Ledoit and Wolf 
2004) 



8~ 



n/m 

{cvf 



(19) 



Estimating A from markers 

The nxm-centered genotype matrix for the markers W (without 
a tilde) represents m realizations of the random variable w and can 
be used to estimate the parameters in Equation 14, the most important 
being the genome-wide covariance matrix 2 = var[w]. If the markers 
are an unbiased sample of genomic loci, then the sample covariance 
matrix S = m^'WW' — (W.j:)(Wl;c) is an unbiased estimator of the 
genomic parameter 2. Moreover, when the number of markers is 
large compared to the number of lines (m ^ «), S is optimal with 
respect to mean-squared error (CaseUa and Berger 2002). This leads to 
the following estimator for A 



'^Y.kPkqk 

As the number of markers decreases, the sample covariance is no 
longer an optimal estimator for the genome-wide covariance matrix. 



where CV is the coefficient of variation of the eigenvalues of S. 
Equation 19 indicates that when the ratio of lines to markers is small 
compared with the dispersion of eigenvalues {nim <C CV^), there is no 
need for shrinkage. This formula is revisited in Results. 

IVIETHODS 
Data sets 

Genotypes for several publicly available populations were used in this 
study: 

(1) Maize diversity panel (Cook et al. 2012) (available at http://www. 
panzea.org/dynamic/derivative_data/Cook_etal_2012_SNP50K_ 
maize282_AGPvl-111202.zip) 

(2) Rice diversity panel (Zhao et al. 2011) (available at ftp://ftp. 
gramene.org/pub/gramene/CURRENT_RELEASE/data/diversity/ 
data_download/hapmap_plink_files/div_rice34.RiceDiversity44K. 
hapmap.tar.gz) 
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(3) Commercial pig population (Cleveland et al. 2012) (available at 
http://www.g3journal.Org/content/suppl/2012/04/06/2.4.429.DCl/ 
FileSl.zip) 

(4) Advanced breeding lines from the North Dakota State Univer- 
sity 2006-2009 two-row and six-row barley breeding programs 
(available by querying the database at http://hordeumtoolbox. 
org) 

For the pig population, we also used phenotypes and progeny-test- 
estimated breeding values (pEBV) for three anonymous traits, down- 
loaded from the same source. Genotypes were curated by eliminating 
markers with more than 10% missing data and lines with more than 
15% missing data. The number of lines and markers after curation are 
shown in Table 1. Missing marker scores were imputed with the 
population mean for each marker. 

Shrinkage intensity 

Let the «xm matrix Z constitute m independent observations of an n- 
variate random variable z with mean 0, for which the sample covari- 
ance matrix is S = m~'ZZ'. Ledoit and Wolf (2004) proved that the 
following shrinkage intensity produces an estimator that is asymptot- 
ically optimal with respect to MSB (Equation 18) 

m-^EIIZ-iZ'-i-Sll' 
S = — . (20) 

l|s-(s,-,)if 

For convenience, we rewrite the numerator in Equation 20 as 

^||Z.;,Z'.t-Sf = tr[(z.tZ'.fc-S)'] (21) 

k k 

= ^ tr(z.;tZ'.;tZ.iZ'.0 - m tr(S^) (22) 

k 

= ^^4z^-mtr(s2) (23) 

k ij 

= mJ2{r.j-S^^ (24) 
•i 

Equations 22 and 23 follow from cyclic permutation properties of 
the trace and the definition of S. Equation 24 introduces the matrix 

r = m-'[Zi][Zi]' (25) 

where [Z?j.] is the matrix formed by squaring the elements of Z. In 
practice, the shrinkage intensity is confined to the interval [0,1] when 
Equation 20 yields a value outside this range (Ledoit and Wolf 2004). 
When applying these formulas to the genomic data, for Z we used the 
centered genotype matrix W adjusted to have zero row means: 

Zik = Wik - m-'J2kWik- 

This shrinkage algorithm has been implemented as part of the 
rrBLUP package for R, version 4.0 (Endelman 2011; R Development 
Core Team 2011). 

Simulation and analysis 

Simulated traits were constructed by first generating additive genetic 
values from the multivariate normal distribution, with variance equal 



to the full-marker relationship matrix (hence, = 1). Independent 
normal deviates with variance cr^ were added to generate phenotypes, 
and the o-^ parameter was modulated to simulate traits with different 
phenotypic accuracies. Figure 3 was generated with crj = 3 for the 
plant species and cr^ = 2 for the pigs. Figure 4 is based on 10,000 
simulations, with log2cr^ chosen from a uniform(— 1,7) distribution, 
and results were binned by reahzed phenotypic accuracy in 0.1 
increments. 

Mixed model prediction of breeding values was conducted with the 
model 

y = /al + a + E (26) 

where y is the vector of phenotypes, /j, is a fixed effect, 1 is a col- 
umn vector of ones, a ~ N{0, A<t^) is the vector of breeding values 
with estimated relationship matrix A, and the residuals are 
£~W(0,I(T^). Computations were done with R package rrBLUP 
(Endelman 2011), which estimates variance components by REML 
using the eigenvalue decomposition algorithm of Kang et al. (2008). 
With this algorithm, the inverse phenotypic covariance matrix V is 
readily generated, after which the BLUE and BLUP solutions for the 
fixed and random effects, respectively, can be calculated using stan- 
dard formulas (Searle et al. 1992) 

fji= (^l'V~'lj \'V~'y 
a = o^AV~\y-Al) 

Accuracy was defined as the Pearson correlation coefficient 
between the genomic estimated breeding values (GEBV = a) and 
either the true breeding values (in the simulation) or the progeny- 
test-estimated breeding values (pEBV) for the pig traits. 

RESULTS 

Table 1 lists several attributes of the five populations used in this 
study. The population sizes ranged from n = 274 (maize) to n = 
3534 (pig). The pig, maize, and rice populations had 30~50K SNPs, 
whereas only 2K SNPs were available for the barley populations. The 
inbreeding coefftcient (J) for each of the four plant populations, cal- 
culated from the mean diagonal element of the relationship matrix, 
was near 1 as expected for inbred lines (imputing missing markers 
with the population mean introduced low levels of heterozygosity). 
The pig population was outbred with/= 0.03. 

Both structured and unstructured populations were included. The 
rice population was a diverse panel of several distinct types [indica, 
japonica, Aus, etc) identifiable with principal component (PC) anal- 
ysis (Zhao et al. 201 1). The observation that 34% of the total variation 

Table 1 Populations 



Population Lines (n) SNPs (m) 1« PC'' CV<^ n/CM^^ 



Pig 


3534 


52,843 


0.03 


0.06 


4.7 


1 


Maize 


274 


44,431 


0.97 


0.05 


1.3 


1.01 


2-row Barley 


383 


2398 


0.95 


0.08 


2.6 


0.35 


2-i-6-row Barley 


763 


1884 


0.97 


0.32 


9.0 


0.06 


Rice 


407 


31,443 


0.96 


0.34 


7.2 


0.05 



^ Inbreeding coefficient, estimated from the relationship matrix. 

Fraction of total variance captured by the first principal component (PC). 

Coefficient of variation (1 = 100%) for the eigenvalues of the covariance 
^matrix. 

Quantities are relative to the pig population (= 1). 
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Figure 1 Histograms of entries in the realized re- 
lationship matrix for the 2-row and 2-i-6-row barley 
populations. The diagonal elements have a mean of 
1 + f « 2 for inbred lines, while the off-diagonal 
elements have a mean of —(1 + fl/n « 0. The bi- 
modal distribution of the off-diagonal elements 
reveals the highly structured nature of the 2+6-row 
barley population. The positive peak contains rela- 
tionships between lines with the same row number, 
while the negative peak is between lines with differ- 
ent row numbers. 



Additive relationship 



was captured by the first PC indicates its higlily structured nature (l'^* 
PC in Table 1). We intentionally grouped the 2-row and 6-row barley 
lines, which as separate populations are unstructured (P' PC < 10%) 
and derived from different breeding programs, into one population to 
create a second structtired population for analysis (32% explained by 
P' PC). The pig and maize populations were relatively unstructured 
(P' PC < 10%). 

Population structure can also be detected from a histogram of the 
realized relationship coefficients. Figure 1 contrasts the unstructured 
2-row barley population with the structured 2-h6-row population. 
Because the relationship coefficients are expressed relative to the cur- 
rent population, the mean of the off-diagonal elements (left panel) is 
— (1 + j)/n, which is essentially 0 for populations with himdreds of 
lines or more. Despite having the same mean, the histogram for the 2- 
row population is tmimodal, whereas that for the 2+6-row population 
is bimodal. The positive peak in the bimodal distribution arises from 
relationships between lines with the same row number, while the 
negative peak corresponds to relationships between lines with differ- 
ent row numbers. The highly structured rice population also has 
a diffuse distribution of off-diagonal elements, whereas the pig and 
maize distributions are unimodal (supporting information, Figiu-e SI). 

The right panel in Figtire 1 shows the distribution of diagonal 
elements in the realized relationship matrices for the 2-row and 2 
+6-row barley populations. Although the mean of the diagonal ele- 
ments is 1+/ and thus at most 2, the individual coefficients can be 
larger than 2, unlike the diagonal elements of the mmierator relation- 
ship matriK. The interpretation of the diagonal coefficients in terms of 
inbreeding is discussed below. 

Shrinkage to minimize MSE 

For each of the five populations, relationship matrices were estimated 
from random subsets of markers, with the shrinkage intensity chosen 
to minimize the expected MSE. As shown in Figure 2, for every 
population, the shrinkage intensity approached zero as marker ntmi- 
ber increased, but there were clear differences in the amount of 
shrinkage at low marker density. With 384 markers, the two struc- 
tured populations (rice and 2+6-barley) had less than 3% shrinkage 
compared with nearly 20% shrinkage for the 2-row barley and over 
30% shrinkage for the maize and pig populations. 

These trends can be understood in terms of the heuristic in Equa- 
tion 19, in which (for a given marker density) the shrinkage intensity 
depends on the ratio nlCV^ between population size («) and the co- 
efficient of variation (CV) for the eigenvalues of the nxn covariance 
matrix. Because the leading principal components in a structured 



population account for a large amount of the total variation, such 
populations have high eigenvalue CV. As shown in Table 1, the rice 
and 2+6-row barley poptilations had the highest CV values (7.2 and 
9.0, respectively), while the maize population had the lowest at 1.3. 
The final column in Table 1 shows the ratio n/CV^ relative to the pig 
population (= 1). Although the pig population was nearly 13 times 
the size of the maize population, its CV was 3.6 times larger, leading 
to nearly identical «/CV^ ratios and shrinkage intensities in Figure 2. 
The two structured populations had the smallest n/CV^ ratios and 
thus also the least shrinkage in Figure 2. The 2-row barley popula- 
tion was intermediate between these extremes. 

The shrinkage intensities in Figure 2 were based on minimizing the 
expected MSE, as determined from a reduced marker set. Figure 3 
(using m = 384 markers) shows that this approach did in 
fact minimize the actual MSE between the full-marker relationship 
matrix and that based on the reduced marker set (see Figure S2 for 2 
+6-row barley). The solid lines show the MSE as a function of the 
shrinkage intensity (in 0.05 increments), and in every case, the 
minimum was attained near the value indicated in Figure 2. For 
the rice and 2+6-row barley populations, the minimimi MSE was 



0.7 
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0.5 



0.4 



0.3 



0.2 



0.1 



0.0 
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A IVIaize 
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96 192 384 768 
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Figure 2 Shrinkage intensity to minimize the expected (VISE. Each 
point is the mean from 20 random subsets of markers (SE < 0.01). As 
expected, the optimal shrinkage decreased as the number of markers 
increased. There was little shrinkage for the structured populations 
(rice, 2+6-row barley) because of their high eigenvalue dispersion 
(see CV in Table 1). 
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attained at 5% shrinkage vs. 2-3% shrinkage based on the expected 
MSE. The correspondence was equally good for the unstructured 
populations: 2-row barley = 15% actual vs. 19% expected; pig = 
35% actual vs. 34% expected; maize = 30% actual vs. 32% expected. 

Maximizing accuracy 

Minimizing the MSE, although theoretically tractable, is not in itself 
particularly useful. A more meaningful criterion is maximizing the 
accuracy of breeding value prediction. The dashed curves in Figure 3 
show the effect of shrinkage on prediction acctiracy, as measured by 
the correlation between GEBV (using the shrunken relationship ma- 
trix and all phenotypes for training) and true breeding values simu- 
lated with the fuU marker matrix. The results indicate that shrinkage 
based on minimizing MSE is somewhat conservative with respect to 
maximizing accuracy. This follows from the observation that the 
maximum in the accuracy curve occurred at higher shrinkage than 
where MSE was minimized. For the maize, rice, and pig populations, 
the shrinkage intensity needed to minimize MSE was 0.20-0.25 less 
than for maximizing accuracy. This difference was somewhat smaller 
for the barley populations, but they only had 2K markers for estimat- 
ing the full marker relationship matrix. 

Figure 4 compares GEBV accuracy against phenotypic accuracy in 
the maize population for a range of simulated heritabHities. The three 
curves correspond to (1) using all 44K markers, (2) using a random set 
of 384 SNPs with shrinkage, and (3) using 384 SNPs without shrink- 
age. For all three methods, the maximum GEBV accuracy relative to 
phenotypic accuracy was observed at a phenotypic accuracy of 0.3 
(SE < 0.004). Comparing the two lower curves, one sees that shrinkage 



improved GEBV accuracy with 384 markers, and the accuracy gain 
increased with heritabUity. At a phenotypic accuracy of 0.9, shrink- 
age improved GEBV accuracy by 0.07 on average. 

Figure 4 also illustrates that phenotypic accuracy can be superior to 
GEBV accuracy for highly heritable phenotypes. When phenotypic 
accuracy was above 0.6, it surpassed GEBV accuracy using random 
sets of 384 SNPs without shrinkage, and the crossover with shrinkage 
occurred at phenotypic accuracy equal to 0.8. This phenomenon arises 
because low-density markers sample the genome incompletely, lead- 
ing to discrepancy between the true and estimated relationship ma- 
trices. If the sampling error is large enough, the accuracy of the 
phenotypes is corrupted rather than improved through the mixed 
model analysis. The "sweet spot" for shrinkage in this simulation 
was at phenotypic accuracies between 0.4 and 0.6. In this range, GEBV 
accuracy was substantially improved by shrinkage and was also higher 
than phenotypic accuracy. 

These trends were confirmed by our analysis of three anonymous 
traits in the pig population, for which progeny-test-estimated breeding 
values (pEBV) are available to calculate accuracy (Cleveland et al. 
2012). Table 2 compares the acctiracy of phenotypes, high-density 
SNPs (53K), and low-density SNPs (random sets of 384), both with 
and without shrinkage. The top row for each trait shows the accuracy 
for individuals with measured phenotypes; the bottom row is for 
individuals without a measured phenotype. Looking at the last two 
columns, one sees a clear benefit to using shrinkage for predicting the 
breeding value of phenotyped individuals, and this benefit increased 
with heritabUity. For trait T3 {h^ = 0.38), shrinkage increased 384 SNP 
GEBV accuracy from 0.56 to 0.62, a gain of 0.06 (P < 10"'° by paired 
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Figure 3 Maximizing accuracy vs. minimizing 
MSE. At shrinkage intensities ranging from 0 to 
0.7, with 0.05 increments, the relationship matrix 
was calculated for random sets of 384 markers. In 
each replicate, the MSE was calculated relative to 
the full marker relationship matrix (MSE = 



'IIA3: 



Afuiilp), and GEBV accuracy was es- 



timated using simulated phenotypes. The two 
curves (dashed = accuracy, solid = MSE) show 
the mean from 40 simulations (SE less than 3% 
of the mean). 
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Figure 4 Prediction accuracy for simulated phenotypes in the maize 
population. The three curves show the difference between GEBV 
accuracy and phenotypic accuracy as a function of phenotypic 
accuracy (SE < 0.004 not shown). GEBV accuracy was highest using 
all markers, followed by 384 SNPs with shrinkage. All three prediction 
methods peaked when phenotypic accuracy was 0.3, while the accu- 
racy gain due to shrinkage increased monotonically with phenotypic 
accuracy. Phenotypic accuracies between 0.4 and 0.6 represented 
a "sweet spot" for shrinkage: in this range, heritability was high 
enough for shrinkage to substantially improve GEBV accuracy but 
not so high that phenotypes were more accurate. 



t-test). For traits T4 and T5 {h^ ~ 0.6), the accuracy gain from 
shrinkage was 0.09 and 0.10, respectively, but phenotypic accuracy 
was still higher. With a phenotypic accuracy of 0.58, trait T3 appears 
to be in the sweet spot: GEBV accuracy was improved by shrinkage 
and was also higher than phenotypic accuracy. 

Table 2 shows that shrinkage did not improve GEBV accuracy for 
the unphenotyped pigs, nor have we observed any benefit in simula- 
tions. For example, even with as few as 96 markers, where the gains in 
GEBV accuracy were 0.1-0.2 in the maize population when training 
on all phenotypes, there was no accuracy gain when predicting imphe- 
notyped lines. 

DISCUSSION 

Although we have built upon the work of PoweU et al. (2010) and 
Yang et al. (2010), our results are different from their unified additive 
relationship, or UAR, model. The UAR model assumed a genetic 
model with standardized coefficients for the causal loci 



^ik = — , - _ 
W^Pk<lk 



(27) 



Equation 27 has the undesirable property that the genetic values of 
lines possessing a rare causal allele tend to infinity as the allele fre- 
quency approaches zero. The marker-based estimate of the off-diag- 
onal elements in the UAR matrix is 



VARii 



{Xj^ - 2pk) {Xjk - 2pk) 
'^Pklk 



(28) 



which also tends to infinity for lines possessing a rare marker allele 
as its frequency approaches zero. Such divergent behavior does not 
occur in the estimators we have derived. Our formula for high-density 
markers (Equation 15) is identical to the first formula proposed by 
VanRaden (2008) for use with an imselected, outbred base population. 
Our IBS derivation provides rigorous justification for using this for- 
mula in any population when the mmiber of markers is much larger 
than the nimiber of lines. 

In the nimierator relationship matrfac, each diagonal element 
equals one plus the probability that the two alleles at a randomly 
chosen locus are IBD from the base population. As this probability lies 
between 0 and 1, the diagonal elements in the numerator relationship 
matrix range from 1 to 2. It was evident from Figure 1 that the 
diagonal elements in the realized relationship matrix can fall outside 
this range. In the UAR model, the diagonal elements have been mod- 
ified to lie in the range 0-2 (Yang et al. 2010; Powell et al. 2010), but 
this has the effect of creating an improper covariance matrix for the 
breeding values {i.e. it may no longer be positive semidefinite). 

From the formula for the realized relationship matrix in Equation 
13, the analog to the inbreeding coefficient for an individual is 



(Pi 



Eki^ik-^Pk) 



(29) 



To gain insight into this formula, note that if p = 1/2 for all loci. 
Equation 29 simplifies to 2i/f — 1, where ifj is the fraction of homo- 
zygous loci. In the context of an IBS model, homozygosity is an 
appropriate state quantity for measuring the inbreeding of an individ- 
ual. The overall inbreeding coefficient / can be written as an average 
over individuals or over loci 



■ Table 2 


Prediction accuracies for pig traits 








Trait 








GEBV"^ Accuracy 


GEBV Accuracy 


GEBV Accuracy 


n 


Phenotypic Accuracy'' 


53K SNP 


384 SNP -1- Shrinkage 


384 SNP, No Shrinkage 


T3 


0.38 


3^A■^d 


0.580 


0.690 


0.617 (0.002)"= 


0.561 (0.002) 






393 




0.465 


0.370 (0.007) 


0.370 (0.007) 


T4 


0.58 


3152 


0.751 


0.809 


0.718 (0.002) 


0.630 (0.002) 






382 




0.569 


0.469 (0.004) 


0.469 (0.004) 


T5 


0.62 


3184 


0.734 


0.765 


0.678 (0.003) 


0.584 (0.003) 






350 




0.520 


0.429 (0.012) 


0.429 (0.012) 



Heritability reported by Cleveland et al. (2012). 
Accuracy = correlation with progeny-test-estimated breeding values. 
^Genomic-estimated breeding values (GEBV) calculated using all phenotyped individuals. 
Within each trait, the top row is for individuals with a measured phenotype; the bottom row is for individuals without a phenotype. 
Mean and SE based on 20 random sets of 384 markers. 
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(30) 



k=l 



where the weights j8j: are given in Equation 12. 

Because the allele content at each locus is centered by the 
population mean, our realized relationship matrix is positive semi- 
definite but not strictly positive definite (there is at least one zero 
eigenvalue). This means the breeding values follow a singular normal 
distribution, but this poses no problem fi-om the perspective of mixed 
model theory (Searle et al. 1992). 

Heritability 

When the genetic covariance is written as proportional to the 
numerator relationship matrix, the proportionality constant is the 
additive genetic variance in the outbred base population. Because the 
IBS-relationship matrix uses the current population as the "base," 
one might expect its proportionality constant, cr^ = 2o-^(pj.ijj.) 
(Equation 11), to equal the genetic variance of the current population, 
but this is not true for inbred Unes. As originally shown by Fisher 
(1941) [see also Kempthorne (1957) and Lynch and Walsh (1998)], 
the additive genetic variance for a single locus with no dominance is 
a\ = 2a\pq(l +/). Compared with the coefficient of the relationship 
matrix, the additive genetic variance is larger by a factor of (1+/). 

This fact has implications for estimating heritability in the narrow 
sense. If the additive genetic values in the mixed model are breeding 
values {i.e. twice the mean progeny value; see Appendix i), heritability 
can be defined using parent-offspring regression as 



cov, [fli,yi] _ a'(y-/xl) 



varj [yi\ 



lly-/^i|| 



(31) 



Replacing the breeding values (and phenotypic mean jx) in Equa- 
tion 31 with their predicted values provides an immediate estimator 
for heritability. By taking the expected value of Equation 31, herita- 
bility can be related to the variance components of the mixed model 
(Equation 26). In File SI, we show that for large populations 



E[h^ 



'(1+/) 



(32) 



Equation 32 can also be used to estimate by replacing the 
variance components with their ML or REML estimates. 

Shrinkage 

Yang et al. (2010) proposed using the identity matrix as a low-di- 
mensional target when shrinking the estimate of the relationship ma- 
trix: A = SI + (1 — S)A. For inbred populations, this estimator is 
not ideal because it shrinks the off-diagonal and diagonal elements 
with the same intensity. By contrast, our estimator does not shrink the 
inbreeding coefficient. 

Using both real and simulated phenotypes, we have demonstrated 
that shrinkage can substantially increase the accuracy of GEBVs for 
phenotyped individuals (or lines), but not for unphenotyped ones. 
Although the term "genomic selection" is typically used in the context 
of predicting unphenotyped individuals, it is also encompasses the 
selection of phenotyped individuals for mating based on GEBV, which 
is important in plant and animal breeding. In plant breeding, we also 
see potential to use the realized relationship matrix with single-replicate 



or unbalanced multi-environment yield trials to more accurately ad- 
vance Unes for variety or hybrid development, and shrinkage may be 
beneficial in these applications. 

Conclusion 

There were two objectives in this study. The first was to formulate the 
realized relationship matrix based on identity-by-state at causal loci 
and by requiring the mean diagonal element to equal 1+/ for the 
current population. For high-density markers, the optimal estimator 
of this relationship matrix is equivalent to the first formula of Van- 
Raden (2008). The second objective was to explore shrinkage estima- 
tion of the relationship matrix at low marker density. In unstructured 
populations with more lines than markers, shrinkage estimation can 
increase the accuracy of GEBVs for phenotyped lines; there is no 
benefit without phenotypes. Particularly when phenotypes have mod- 
erate accuracy, e.g. from preliminary yield trials in plant breeding, 
shrinkage estimation has the potential to improve the selection of 
lines as parents or for variety development. 
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Appendix 1 

We prove tJiat, for an additive trait in an arbitrary population, the 
formula in Equation 4 for tlie genetic value of line i 



« 'E/(**i -pk)[xiki -pk) 



pkqk 



also quantifies the deviation from Hardy- Weinberg proportions. 
For allele content X,- e {0,1,2}, the proportion of heterozygotes (H) is 



equals two times the mean genetic value of its progeny under 
random mating, expressed relative to the population mean. In other 
words, the additive genetic value is the breeding value. By assuming 
the genetic effects u^. are physiological parameters of the causal loci 
(rather than defined in a least-squares sense), we can write the prog- 
eny genetic value as gij = '^j^Z/kUk where Zjk is the allele dosage at 
locus k in progeny j. The mean progeny value for line i is thus 



(33) 



where the expectation is with respect to the random processes 
of gamete segregation and mating. The former contributes iX,j: and 
the latter contributes pf^ for a mean progeny value of ijp^ik + ft) 
Subtracting the population mean 2'}2kPk'^k and multiplying by two 
produces Equation 4. Note that Equation 33 does not require the causal 
loci to be in linkage equiUbriimi (the linearity of the expectation oper- 
ator does not require statistical independence). 

Appendix 2 

We prove that for a single locus, the inbreeding coefficient defined by 
Equation 10 (written here without overscript tildes) 



ff=n-'^Xi(2-X,) 



(34) 



Upon substituting X, = x,i -i- into Equation 34, where and Xi2 
are binary variables denoting the two gametes in individual /, the 
result is 



(35) 



H ^ n {xii + Xi2){2 - Xii - Xi2) 

i=\ 

= + Xi2 - 2XiiXi2) 

= 2p-2n-^J2iixii -p+p){xi2-p+p) 
= 2p- 2p^ - 2pqf 
= 2pq(l -/) 



Similarly, the proportion of X, = 2 homozygotes is (omitting sev- 
eral steps) 



1=1 

= n"' YjiXiiXi2 
= pqf + p^ 



(36) 
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